Add modular pipeline for HunyuanVideo 1.5 by akshan-main · Pull Request #13389 · huggingface/diffusers

akshan-main · 2026-04-02T16:09:27Z

What does this PR do?

Adds modular pipeline blocks for HunyuanVideo 1.5 with both text-to-video (HunyuanVideo15Blocks) and image-to-video (HunyuanVideo15Image2VideoBlocks).

Parity verified on Colab G4 GPU:

T2V: MAD 0.000000 vs HunyuanVideo15Pipeline

hv15_t2v_standard.mp4

hv15_t2v_modular.mp4

T2V reproduction code

import gc
import numpy as np
import torch
from diffusers import (
    HunyuanVideo15Pipeline,
    HunyuanVideo15ImageToVideoPipeline,
    HunyuanVideo15Blocks,
    HunyuanVideo15ModularPipeline,
)
from diffusers.utils import load_image, export_to_video

device = "cuda"
dtype = torch.bfloat16

T2V_ID = "hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-480p_t2v"
I2V_ID = "hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-480p_i2v"

def to_np(x):
    if hasattr(x, "frames"):
        x = x.frames
    if isinstance(x, list):
        x = np.array(x)
    if isinstance(x, torch.Tensor):
        x = x.float().cpu().numpy()
    return x
prompt = "A cinematic drone shot over snowy mountains at sunrise."

print("=== Standard T2V ===")

ref_pipe = HunyuanVideo15Pipeline.from_pretrained(T2V_ID, torch_dtype=dtype).to(device)
g = torch.Generator(device=device).manual_seed(1234)
ref_out = ref_pipe(prompt=prompt, num_frames=55, num_inference_steps=6, generator=g, output_type="np").frames
print(f"Shape: {np.array(ref_out).shape}")
export_to_video(ref_out[0], "/content/hv15_t2v_standard.mp4", fps=24)
del ref_pipe; gc.collect(); torch.cuda.empty_cache()



print("\n=== Modular T2V ===")
blocks = HunyuanVideo15Blocks()
pipe = blocks.init_pipeline(T2V_ID)
pipe.load_components(torch_dtype=dtype)
pipe.to(device)

print("Guider type:", type(pipe.guider).__name__)
print("Guider scale:", pipe.guider.guidance_scale)
print("Guider enabled:", pipe.guider._enabled)
print("Guider num_conditions:", pipe.guider.num_conditions)
g = torch.Generator(device=device).manual_seed(1234)
mod_out = pipe(prompt=prompt, num_frames=55, num_inference_steps=6, generator=g, output="videos", output_type="np")
print(f"Shape: {np.array(mod_out).shape}")
export_to_video(mod_out[0], "/content/hv15_t2v_modular.mp4", fps=24)

diff = np.abs(to_np(ref_out).astype(float) - to_np(mod_out).astype(float)).mean()
print(f"\nT2V MAD: {diff:.6f}")
del pipe, blocks; gc.collect(); torch.cuda.empty_cache()

I2V: MAD 0.000000 vs HunyuanVideo15ImageToVideoPipeline

hv15_i2v_standard.mp4

hv15_i2v_modular.mp4

I2V reproduction code

from diffusers.modular_pipelines import HunyuanVideo15Blocks, HunyuanVideo15Image2VideoBlocks, HunyuanVideo15ModularPipeline

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png").convert("RGB")

print("=== Standard I2V ===")
ref_pipe = HunyuanVideo15ImageToVideoPipeline.from_pretrained(I2V_ID, torch_dtype=dtype).to(device)
g = torch.Generator(device=device).manual_seed(1234)
ref_out = ref_pipe(image=image, prompt="A cat turns its head", num_frames=55, num_inference_steps=6, generator=g, output_type="np").frames
print(f"Shape: {np.array(ref_out).shape}")
export_to_video(ref_out[0], "/content/hv15_i2v_standard.mp4", fps=24)
del ref_pipe; gc.collect(); torch.cuda.empty_cache()

print("\n=== Modular I2V ===")
blocks = HunyuanVideo15Image2VideoBlocks()
pipe = blocks.init_pipeline(I2V_ID)
pipe.load_components(torch_dtype=dtype)
pipe.to(device)
g = torch.Generator(device=device).manual_seed(1234)
mod_out = pipe(image=image, prompt="A cat turns its head", num_frames=55, num_inference_steps=6, generator=g, output="videos", output_type="np")
print(f"Shape: {np.array(mod_out).shape}")
export_to_video(mod_out[0], "/content/hv15_i2v_modular.mp4", fps=24)

diff = np.abs(to_np(ref_out).astype(float) - to_np(mod_out).astype(float)).mean()
print(f"\nI2V MAD: {diff:.6f}")
print("\n=== Done ===")

Addresses #13295 (HunyuanVideo 1.5 contribution)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case. — Modular Diffusers 🧨 #13295
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@sayakpaul @yiyixuxu @asomoza

akshan-main · 2026-04-06T14:29:35Z

hey guys @yiyixuxu @sayakpaul would greatly appreciate a review!

sayakpaul · 2026-04-06T14:33:33Z

@claude could you do an initial review here?

github-actions · 2026-04-06T14:33:47Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

akshan-main · 2026-04-06T15:49:30Z

@sayakpaul looks like the Claude bot run failed on this one

akshan-main · 2026-04-13T18:49:23Z

friendly ping @sayakpaul @yiyixuxu , awaiting review for modular hunyuanvideo 1.5. Will greatly appreciate a review!

akshan-main · 2026-04-15T16:15:12Z

Hey guys this is on more of my contribution to modular diffusers @yiyixuxu @sayakpaul

yiyixuxu

thanks!

yiyixuxu · 2026-04-15T21:36:12Z

@akshan-main
thanks for your PR and contribution!
we'll wait to merge until we fix our claude CI and have it do a review as well, as we're currently testing and rolling it out.

HuggingFaceDocBuilderDev · 2026-04-15T21:40:31Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

akshan-main · 2026-04-16T02:48:03Z

@yiyixuxu I think these tests are always failing; they have nothing to do with my PR. Anyways I was wondering if you'd like me to work on something else within or outside modular diffusers. Will be glad to contribute or test out any pipelines.

yiyixuxu · 2026-04-16T19:15:44Z

@akshan-main
how about modular ernie-image?

yiyixuxu · 2026-04-16T19:17:40Z

@claude can you help to do a review here? please pay addition attention to:

look over all the PipelineBlocks to see if there is any inputs/intermediate_outputs/expected_components that are defined but not used in call
compare it with modular implementation of wan/qwen/flux2 to see if there is anything that's missing

github-actions · 2026-04-16T19:18:00Z

Claude finished @yiyixuxu's task in 6m 50s —— View job

Review: Modular HunyuanVideo 1.5 Pipeline

Read repo guidelines (CLAUDE.md, AGENTS.md, modular.md)
Read all changed files in the PR
Compare with wan/qwen/flux2 modular implementations
Check PipelineBlocks for unused inputs/intermediate_outputs/expected_components
Post final review

Overall the implementation follows the modular pipeline conventions well — block structure, file layout, guider abstraction, and denoise loop patterns are all correct and consistent with wan/flux2/qwen. Parity is verified at MAD 0.000000 for both T2V and I2V, which is great.

Below are the findings, organized by severity.

Issues to fix

1. Missing __init__.py registrations for HunyuanVideo15Blocks and HunyuanVideo15Image2VideoBlocks

These classes are exported from the local hunyuan_video1_5/__init__.py but are not registered in:

src/diffusers/modular_pipelines/__init__.py
src/diffusers/__init__.py
src/diffusers/utils/dummy_torch_and_transformers_objects.py

This means from diffusers import HunyuanVideo15Blocks (as shown in the PR description) will fail. The wan pipeline registers WanBlocks, Wan22Blocks, etc. in all three files — this PR should do the same.

Fix this →

2. Test model path uses personal repo (before_denoise.py:49, test_modular_pipeline_hunyuan_video1_5.py:49)

pretrained_model_name_or_path = "akshan-main/tiny-hunyuanvideo1_5-modular-pipe"

Per modular.md gotcha #9: "Tiny test models must live under hf-internal-testing/, not personal repos." The model should be moved to hf-internal-testing/ before merge.

3. HunyuanVideo15TextInputStep declares num_videos_per_prompt as input but never uses it (before_denoise.py:102, 114-118)

The input is declared at line 102:

InputParam.template("num_images_per_prompt", name="num_videos_per_prompt"),

But the __call__ (lines 114–118) only uses batch_size and prompt_embeds — num_videos_per_prompt is never accessed.

4. Encoder accepts pre-computed embeddings (modular.md gotcha #4) (encoders.py:166-176)

HunyuanVideo15TextEncoderStep accepts prompt_embeds, prompt_embeds_mask, negative_prompt_embeds, negative_prompt_embeds_mask, prompt_embeds_2, prompt_embeds_mask_2, negative_prompt_embeds_2, negative_prompt_embeds_mask_2 as inputs to skip encoding.

Per modular.md:

"In modular pipelines this is unnecessary — users just pop out the encoder block and run it separately. Encoder blocks should only accept raw inputs (prompt, image, etc.)."

The Wan text encoder, for comparison, only accepts prompt, negative_prompt, and max_sequence_length. These pre-computed embedding inputs should be removed.

Suggestions (non-blocking)

5. HunyuanVideo15PrepareLatentsStep accesses components.video_processor without declaring it (before_denoise.py:204)

The block accesses components.video_processor.calculate_default_height_width(...) but video_processor is not in its expected_components. It works in practice because the decoder block registers it, but this is fragile if the block is used standalone. Consider either adding it to expected_components or replacing with pipeline properties.

6. Many TODO: Add description. placeholders remain in auto-generated docstrings (modular_blocks_hunyuan_video1_5.py)

The docstrings for prompt_embeds_2, prompt_embeds_mask_2, negative_prompt_embeds_2, etc., num_frames, and image_latents all show TODO: Add description. For example, lines 62–83 of modular_blocks_hunyuan_video1_5.py. While other merged implementations (wan, flux2) also have unfilled TODOs, it would be nice to fill these in — especially for the dual-encoder parameters that are unique to HunyuanVideo 1.5.

7. Code duplication between T2V and I2V denoisers (denoise.py:59-149 vs 238-334)

HunyuanVideo15LoopDenoiser and HunyuanVideo15Image2VideoLoopDenoiser are ~90% identical, differing only in the MeanFlow timestep_r computation (lines 293–301) and the extra timestep_r=timestep_r kwarg. This is not a blocker since the pattern matches other implementations (e.g., Wan has WanLoopDenoiser and Wan22LoopDenoiser), but noting it for awareness.

Block-by-block audit summary

Block	Unused inputs	Unused outputs	Unused components	Notes
`HunyuanVideo15TextEncoderStep`	`prompt_embeds*` (gotcha #4)	—	—	Pre-computed embeds violate modular convention
`HunyuanVideo15VaeEncoderStep`	—	—	—	Clean
`HunyuanVideo15ImageEncoderStep`	—	—	—	Clean
`HunyuanVideo15TextInputStep`	`num_videos_per_prompt`	—	—	Declared but never accessed in `__call__`
`HunyuanVideo15SetTimestepsStep`	—	—	—	Clean
`HunyuanVideo15PrepareLatentsStep`	—	—	—	Uses undeclared `video_processor`
`HunyuanVideo15Image2VideoPrepareLatentsStep`	—	—	—	Clean
`HunyuanVideo15LoopBeforeDenoiser`	—	N/A	—	Clean
`HunyuanVideo15LoopDenoiser`	—	N/A	—	Clean
`HunyuanVideo15LoopAfterDenoiser`	—	N/A	—	Clean
`HunyuanVideo15Image2VideoLoopDenoiser`	—	N/A	—	Clean
`HunyuanVideo15VaeDecoderStep`	—	—	—	Clean

akshan-main · 2026-04-16T19:31:10Z

@yiyixuxu will address 3 and 4. skipping 1 since earlier feedback on the LTX PR was to only export AutoBlocks from top-level. For 2, can akshan-main/tiny-hunyuanvideo1_5-modular-pipe be transferred?

yiyixuxu · 2026-04-16T19:40:11Z

+        "HunyuanVideo15Blocks",
+        "HunyuanVideo15Image2VideoBlocks",


Suggested change

"HunyuanVideo15Blocks",

"HunyuanVideo15Image2VideoBlocks",

yiyixuxu · 2026-04-16T19:40:38Z

+            HunyuanVideo15Blocks,
+            HunyuanVideo15Image2VideoBlocks,


Suggested change

HunyuanVideo15Blocks,

HunyuanVideo15Image2VideoBlocks,

yiyixuxu · 2026-04-16T19:43:39Z

+
+    @property
+    def expected_components(self) -> list[ComponentSpec]:
+        return [ComponentSpec("transformer", HunyuanVideo15Transformer3DModel)]


indeed missing a video_processor here #13389 (comment)

yiyixuxu · 2026-04-16T19:47:00Z

@akshan-main sounds good, let's try to address #6 too

akshan-main · 2026-04-16T20:12:00Z

@akshan-main how about modular ernie-image?

on it

yiyixuxu

thanks

* Add modular pipeline support for HunyuanVideo 1.5 * Fix I2V latent/cond spatial dimension mismatch * Fix guidance_scale default to 7.5 matching ClassifierFreeGuidance * Fix tokenizer type: use Qwen2TokenizerFast to match model * Fix system message string formatting to match standard pipeline * Rewrite HunyuanVideo 1.5 modular: use standard pipeline methods directly * Remove I2V exports (T2V only for now) * Fix encoder: use static methods directly instead of encode_prompt * Inline all standard pipeline methods, remove runtime dependency * Add HunyuanVideo 1.5 image-to-video modular blocks * Fix missing FrozenDict import in before_denoise.py * auto-generated docstrings via #auto_docstring * Fix ruff lint and format issues * use InputParam/OutputParam templates and fix ruff * Address LTX review feedback here like add AutoBlocks, refactor I2V latents, lift encoders * Add workflow map, workflow tests, auto docstrings, export only AutoBlocks * Address Claude CI review * Address claude CI review 2 --------- Co-authored-by: YiYi Xu <yixu310@gmail.com>

akshan-main mentioned this pull request Apr 2, 2026

[modular] Add LTX Video modular pipeline #13378

Merged

6 tasks

github-actions Bot added tests modular-pipelines size/L PR with diff > 200 LOC labels Apr 10, 2026

This was referenced Apr 10, 2026

HunyuanVideo 1.5 I2V image conditioning preprocessed at latent resolution instead of pixel resolution #13439

Closed

Fix HunyuanVideo 1.5 I2V by preprocessing image at pixel resolution i… #13440

Merged

github-actions Bot added utils size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 11, 2026

akshan-main added 13 commits April 10, 2026 21:45

Add modular pipeline support for HunyuanVideo 1.5

48b4f3f

Fix I2V latent/cond spatial dimension mismatch

f029fc5

Fix guidance_scale default to 7.5 matching ClassifierFreeGuidance

5cc2ed1

Fix tokenizer type: use Qwen2TokenizerFast to match model

db41b41

Fix system message string formatting to match standard pipeline

bfbb6b4

Rewrite HunyuanVideo 1.5 modular: use standard pipeline methods directly

aec5b38

Remove I2V exports (T2V only for now)

d9ad0ef

Fix encoder: use static methods directly instead of encode_prompt

5923b5e

Inline all standard pipeline methods, remove runtime dependency

1691e31

Add HunyuanVideo 1.5 image-to-video modular blocks

d189ef1

Fix missing FrozenDict import in before_denoise.py

cfd5de4

auto-generated docstrings via #auto_docstring

acada9b

Fix ruff lint and format issues

6ec688d

yiyixuxu approved these changes Apr 15, 2026

View reviewed changes

yiyixuxu added the close-to-merge label Apr 15, 2026

Merge branch 'main' into modular-hunyuan1.5

458c97c

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 16, 2026

Address Claude CI review

da4f4ef

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 16, 2026

yiyixuxu reviewed Apr 16, 2026

View reviewed changes

Address claude CI review 2

2bfc708

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 16, 2026

akshan-main requested a review from yiyixuxu April 16, 2026 20:08

yiyixuxu approved these changes Apr 16, 2026

View reviewed changes

yiyixuxu merged commit b3889ea into huggingface:main Apr 16, 2026
17 of 18 checks passed

akshan-main mentioned this pull request Apr 17, 2026

Add Ernie-Image modular pipeline #13498

Merged

6 tasks

Conversation

akshan-main commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

akshan-main commented Apr 6, 2026

Uh oh!

sayakpaul commented Apr 6, 2026

Uh oh!

github-actions Bot commented Apr 6, 2026

Uh oh!

akshan-main commented Apr 6, 2026

Uh oh!

akshan-main commented Apr 13, 2026

Uh oh!

akshan-main commented Apr 15, 2026

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

yiyixuxu commented Apr 15, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 15, 2026

Uh oh!

akshan-main commented Apr 16, 2026

Uh oh!

yiyixuxu commented Apr 16, 2026

Uh oh!

yiyixuxu commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: Modular HunyuanVideo 1.5 Pipeline

Issues to fix

Suggestions (non-blocking)

Block-by-block audit summary

Uh oh!

akshan-main commented Apr 16, 2026

Uh oh!

yiyixuxu Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

yiyixuxu Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

akshan-main Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

yiyixuxu Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

akshan-main Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

yiyixuxu commented Apr 16, 2026

Uh oh!

akshan-main commented Apr 16, 2026

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

akshan-main commented Apr 2, 2026 •

edited

Loading

github-actions Bot commented Apr 16, 2026 •

edited

Loading